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AESTRACT 


Under a previous contract with the NASA Langley Research Center 

(NASA Contract NASl-12668), Raytheon greatly extended aii existing computer 
program called CARE (Computer Aided Reliability Evaluation), thereby enabling 
it to calculate the reliability of any dual-mode, spare -switching computer 
system, 'Phe results of that efTort are described in ’’Reliability Model 
Derivatatlon of a Fault -Tolerunb , Dual, Spare -Switching, Digital Conputer 
System, Final Report", 25 March 19T^, (Raytheon Report No. ERT^-^108). 

The emphasis in that report was on the conputer program itself j the 
mathematical model on which the program was based was briefly outlined but 
not described in derail. This document supplements this earlier report by 
providing such a description, presenting some illustrative examples, and 
examining the possii.ility of extending the conputer program even further, 
to enable it, in particular, to accommodate computer configurations involving 
more than two moueu or operation. 





TABia OF C0BT!ENTS 


SECTION 

I INTRODUCTION 

II RELIABILITY MODEL DERIVATION 

III COVERAGE MODEL DERIVATION 

IV SOME EXAMPLES 

V EXTENSION TO THREE AND MORE MODES 

VI GONGLUSIONS AND RECOMMENDATIONS 

VII APPENDIX 


j 

figures 

PAGE 

MUMBDBR V 
1 

CARE II reliability M3UEL 

2 

2 ' 

SOFTWARE DIAGNOSTIC SCHEDULING 

22 


DIAGNOSTIC TEST SCHEDULES 

34 

3 


NUMBER 

TABLES 

page 


SITWII -STAGE RELIABILITIES 

29 

1 

2 

MINE ICAL RELIABILITIES 

3u 

3 

mjME, -,ICAL BELIASILITIES - 310-STAGE COMFIGUEATIOH 

I- 

h 

COVERAGE COEFFICIENTS 





I 


Q 




’^x,Z,e 


Non-ne^tlye Integers -> Indlcles of sunnetlpn. 

Biased siuonatlon indlcles; e.g. 1' == 1 + c with c 
a constant. 

Integer (subscript) Indicating mode of operation. 

Integer (subscript) indicating computer stage (e.g. 

CPU, l/o unit, memory module, etc.; 

Number of identical operational units needed at stage x 
in mode jg . 

0 T Q „ - 1 if units that were active in mode 1 
bu'i are iAitially not needed in mode 2 can be treated 
as spares; r = 0 otherwise (or if 0.x^2 ^ ^x,l^» 

The number of spare units initially available to stage x. 
Time variables 

Tiine needed to test a spare at stage x during mode 1. 


Hazard rate for an active unit at snage x. 


M X 


Kx 


y X 




(i>t) 


Hazard rate for a dormant unit at stage x. 

Xx4*x = 'ioi*™ancy factor for stage x. 

Rate of occurrence of " category l’' failures; i.e. 
failures that prevent the system from operating in 
mode X for any X < i, i = 2,3,...j tut do not preclude 
operation in any mode Xi^i. 

Kaxe of occurrence of xransienx failures b.0 £3ucXg,C -V. 


The probability that exactly i of the . . spar e s available 
at stage x have been used after t time units of operation 
in mode X . u ■ 


■^The snl LU-iptsOri tliesc symbols are appenaeu . only vnen it is : necessary to 
■vti ni. i c • li ui- exT li ’ ‘uly ! x.rweur tlie ’'arlttr modes, stages etc« rnus, i’or 
exanii le , tiie symbol Qjjx ts frequently represented by simply Q with the 
subscri' ' n Implied. 



R(t) 

E/t) 


«x(’) 

TjCt) 


T2(t) 




System rell&blllty; l*e. the probmhlllty that the 
system Is atlll operating succeeefuUy at time t. 

Probability that the system is still operating at 
time t and that it is operating in mode i . 

The probability that stage x has operated successfully 
in mode jfc for t time units • given that operational 
spares were Initially available. 

Transition probability density; l.e. probability density 
of a failure in stage x resulting in a successful degen- 
eration to mode 2 due to the lack of any remaining spares. 

Probability that stage x survives until time r in mode 
1 and from time t to time t in mode 2. 

Probability that system successfully' enters mode 2 due 
to a deficit of spares in some stage and (given no category 
three failures) survives in that mode until time t. 

Probability that system successfully enters mode 2 
following a category two failure and (given no category 
three failures) survives in that mode until time t. 

The r'ondltlnnal probability that the system can recover 
from a fault in stage x during mode i, given that the 
fault belongs to fault class is detected by detector 
i, and that the spare for stage x is the first spare 
found to be operational. 

Fraction of faults at stage x belonging to class j during 
mode t . 


£ k Coverage (i.e. the conditional recovery probability) 

* ’ for stage x diiring mode X given that the k^^ spare 

is the first spare found to be operational; 


The superscript ( ' ) on these terms indicates transitional 
* * coverarr parameters: 1 .e . 1^. denotes *'he coverage in 

, stage X during '^mode X when none of the k-1 remaining spares 

^>x,X, k are operational and it is hence necessary to degenerate 

to mode & + 1. 




The prc*al3ility that a class j fault in stage x during 
mode i would be detected by detector 1 were it the only 
detector operating. 

Rate at which detector i would detect category j faults 



(t) dt 


Rate at which detector 1 detects category j fault e 
in stage x during mode i given that all other relevant 
detectors are also in operation. 

Probability that a category j fault in stage x detected 
by detector i during nwxie X is successfully isolated to 
the faulty unit. 

Isolation rate associated with detector 1 following a 
category j fault in stage x during mode X 


Probability that a spare can be successfully tested 
in stage X during mode X . 

Probability of successful recovery in stage x during 
mode X following a class j fault detected :by detector 
i when the detection and isolation rates are rand 
f ' , respectively. 



3il»MXJGTI0» 

The reliability model implemented in CARE II (l.e., the Raytheon 
extension to the CARE program) can be described with reference to Figure 1. 
The computer system being modeled is assumed segregated into n stages 
(1 i n^8) with switchable spares separately provided for each stage. 

Two modes of operation are possible. In mode 1, identical units must 
be functioning at stage ±, i = 1,2,..., n , for the system as a whole to be 
operational in that mode; in mode 2, Qgi are required at stage i. 

The system begins operation in mode 1 (cf. Figure l) and continues 
in that mode until a failure occurs that either forces the system into mode 
2 or else causes a system failure. The latter can be caused by a coverage 
failure or by a category three hardware failure . This latter term includes 
all those failures that, by themselves, preclude further operation regardless 
or the number of functioning units at any stage. (Category three failures 
are frequently referred to as single-point failures.) 


Degeneration to mode 2 can be caused either by a category two hardware 
failure or by spares exhaustion in any particular stage; i.e. by a failure 
in any one of the unius needed at stage x after all of tne 

spares for that stage have already been used. A category two failure Is 
one t'nat prevents further operation in mode 1 even though a sufficient number 
of units is available at each stage. (E.g., if mode 1 operation entails 
the coniparison of tne outputs . generated by two. independent, parallel systems, 
a failure in the comparator could constitute a category two failure . ) 

Similarly, the system will contirme to operate in mode 2 until a coverage 
I'ailure, a category three failure, or a failure following tiie depletion of all 
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II RELIABILITY MODEL 


in 4u«ation cauMOS the vhole ayatem to 

flail. 

It will be noted from Figure 1 that both perraanent and transient 
hardware failures are modeled. Since transient failures by definition do 
not permanently disable any hardware, there are only two possible outcomes 
of such a failure: either the system recovers from a transient failure and 
successfully resumes the application programs or it does not. The latter 
case is defined as a system failure. 


The fidelity with which this model determines the reliability of an 
actual system is highly dependent upon the accuracy with which the various 
coverage probabilities indicated in Figure 1 can be determj.ned. That is, 
.''Iven a hardware failure of a particular sort, and the availability of the • 
necessary spares, what is the conditional probability that computer system 
can aciuaily recover and resume its intended function? Because of the 
importance of these parameters, a coverage model was also postulated and 
programmed as part of the CARE II package. This model provides a means for 
calculating coverage as a function of the type of failure experienced, its 

("I . e. , permanent or transient), the number of spares 
cha* r.'.rt le testea before an operational one is found, the time delays as- 


soc ia*.r with the various fault detection and isolation mechanisms, and the 
procai rlity that a ;successrul recovery can ce achieved given tin v --'t: 
ver-- ie.t to detf'ct . failure and t- seconds, needed to isolate i*; . 


Details of the coverage model iirplemented in CARE II are presented in 
Sec^ ion III. ri:e nex’ cert O:.. (Seccfct If however, ^irst des-ribes the CARE II 

reliability model itself. Some speei^'ic simple, coverages and reliability 
pic:iv..c are solved analytically in Section IV in an effort to illustrate 


the generality and flexibility of the CARE II model. Einally, Section 
V concludes with a discussion of potential further extensions of the 
presently implemented reliability model, including the possibility of 
expanding it to include systems capable of operating in three or m-'tre 



Reliability Mcxiel IV*j<4/«tion 

CARE II reliability model la baaed on the asaumptlon that each 
covportent comprising the co.iiputer aystem exhibits a constant failure or 
^lazard rate A . This assumption, which Is almost universally made In 
deriving computer reliability models, Is well supported by experimental 
evidence. The two major limitations to its validity are due to the "Infant 
mortality" phenomenon resulting In a decreasing failure rate for new, untested 
ports and to wear-out mechanlsmc (e.g. tungsten evaporation In a Incandescent 
light-bulb filament) causing the failure rate to Increase with age. Since 
the infant mortality phenomenon can be (and. In those cases in which reliability 
is of concern, presumably would be) eliminated by appropriate screen-and-bum- 
in procedure 8, and since wear-out mechanisms are virtually nonexistent in solid- 
state devices, the constant failure-rate assumption appears to be entirely 
adequate for the present purpose. 

If P(t) denotes the fraction of elements of a particular class operating 
at time t, then -P'(b)dt, with P'(t) denoting the time derivative of P(t), 
Indicates the Iractlon of elements failing in the time Interval ( t,t+dt). 

The failure rate associated with this class of elements is then: 

A(t) = - P‘ (t) (1) 

- p r o 

(i.e. tne fraction cf presently operating elements failing per unit time). 

If A ^t) = A is oun: ’int , tr.et. the solution to eq^atic:. 1 t .c 

V'^'und... oniition tr at P(o) * l) is: 

P(t) - 


Further, If an Irredundant comp'Uter .r.l* conj-csed nf elements . avlng 

[-a • A . : - 1,_, tlier. * yr-Lsrili:^ "r.at that aiiit is still 

tr.at it was oj.erating at tirne zero, is the y-rotatility 



operatir.(' a' time t, given 


that of Its CQcg;>onent parts ar« operating at tine t; l.e 


P(t) 


N 

n 


(e 


M 

^-(L‘ 

1-1 


- e 


-it 


( 2 ) 


N 

with A ” 1^1 8in?)ly the sun of the failure rates of the unit's 

i - 1 

component parts. 

Now, if Q of these units sure needed for a given coagputer configuration to 
function properly and if 1 standby spares are provided which can be switched 
in to replace any defective units, the probability that exactly 1 such spares 

have been required by time t is simply the probability that exactly 1 unite out of 

, , . » t h 

a total of Q + 1 -1 units in the configuration have failed emd that the (Q + 1 j 

unit is still functioning. To see this, assign the units used up to time t 

consecutive numbers from 1 to Q + 1, with units 1 through Q representing the 

original Q operating units, unit Q + 1 the first spare switched in to replace 

a defective unit, unit Q ■* 2 the second spare switched In, etc. Exactly 

i spare units will have been used by time t if and onl^,- if exactly 1 cf the 

units bearing the numbers 1,2,3>...> Q ‘♦’i -1 have failed. Note that the 
/ \ t h 

(0. * 1 , ■ •»•’:?* sMll re operational since, were it not, at leas‘d 1*1 

spares would have been required. But the probability P(l,n) of exactly 1 failures 
in n chances, when the probability of an individual failure is q * 1 - p is given 
by the well-known binominal distribution: 


P(l, n) .( 1 ) 


- 0 - 



Consequently, the probability o(l,t) that i of Q i-1 units 
have failed by time t and that the (Q 1 unit is still functioning 
is P( i , Q + i -l) with e"^Hhe probability that any one of them 
has survived until time t; l.e*: 

( 3 ) 

Note that this last conclusion implicitly assumes equal failure rates 
for all Q ■*" i units. If, however, the units that are operational but are 
not currently being used are placed in a dormant mode until they are needed 
(e.g. the units are not powered until they are actually needed), this assumption 
of identical failure rates may not be valid; the dormant units ma:,- well fall 
at a rate m significnntly less than the active unit failure rate a • This 
added degree of freedom potentially complicates the expression for the prob- 
ability that exactly i units are required since the probability that a part- 
icular unit is still operational is now a function of when it was placed in 
the active m de which In turn, is determined by the number of prior failures. 

It can be shown"^ , however, that the probability of using exactly- i spares 
when Q active units are needed and the active and standby unit failure rates are 
Xand»< respectively, is equal to the probability of using exactly, t spares vi.tr. 
XQ/*i active units ar>* needed for successful operation and the failure rate is 
M for ac'.x.c a.iU s^anuLy units. 


'f P-Oi.-.’-' r.._ur.'®rcy, lEil- Trans. nexxtii.. 
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That Is, from equation ( 3 ), 


with K « k/n . This observation considerably slnpllfles the subsequent deriva- 
tions . 

Note that Kft + 1-1 need not be an intejjer here. The binomial coeffi- 
cient; 



Is still defined for noninteger values of KQ + 1-1 and equation (U) still 
holds . 

An additional complication results when the possibility of a coverage 

failure Is acKnowledged since the possibility then exists that the system was 

unable to recover ev.;n though a spare was available. Moreover, the probability 

of su.’oessful recovery may well be further diminished when one or more cf ‘he 

spares has already failed by the time it is needed since this presumably increases 

the total time needed to recover. In the CARE II reliability model, the coverage 

prcbatillty (determined using the coverage equation: cf.. Section III) is 

(k - 1 ) 

expres-p" Ir, the form. C5 , with k the number of spares tra*: mus‘ re 

tesit; i lefore an opej-ational one is found. That is, if t;.e first spare tested 
Is jor.fi 1 . ‘ a •»'ohnHli*v of recovery is C; if the first spare has failed 

but t.hc second is furictioning, the recovery probability is diminlsr ed by the 


- 8 - 


factor 6 ; if two failed sparea are encountered the recovery probability 
Is decreased by the factor etc. 

Fortunately, the effect of imperfect coverage can be included In equation 
(U) simply by replacing the product Kft by the parameter M ** KC6*^ Q and by 
multiplying the result by The resulting probability thus becomes 

0(l,t) ,1 (l,.e ->‘t) 1 , -KBMt (5) 

A proof of the '/all-ity of tl.is rcs'li is ^resented in the appendix to 
this report. 


Equation ( 5 ) represents the probability that the Q-unlt ensemble 
in question survives until time t given exactly i hard (permanent) failures. 

The CAPE II model also Includes the possibility of transient failures. The 
coverage model (cf. Section III) provides a means of determining the prob- 
ability Pp of recovering from a transient as well as from a permanent failure. 

If transient failures occur at the rate y ' failures per unit time, nonrecoverable 
fallurec •’her. occur at ^he rate y= (1 -Pj,)y'. The probability j(i,t) in 
equa’'lon ( 5 ) must, therefore, be multiplied by the probability e that nc 
nonrecoverable transients occur by time t in order to obtain tne protacility 
• tiat the Q-unit ensemble survives until time t using i spares when ceth haru 
and 'ransient failures are taken in*-o a'‘<’'^unt. The resulting express' ^n •' 
becomes: 


-.(l.t) - .. (1-. -"M ' e 

\ f 

With these pr#*llmi naries , we can :iOw letermlne tne reliacllity 
of the dual-mode computer system, r'lrst, let: 


(• ) 
H( t ) 


P(t) 


Rl (■ 


Rc (t^ 
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(t) denoting the inrotoeblllty that the ayateit eurvlvee until tlae t 
In node 1 and Rg (t) the probability that It aurvlves In mode 2. (That 
is, Rg (t) Is the probability that ti» system Is still operating at time t 
butf that It had to svltch to mode 2 sometime prior to time t.) 

Let ^ defined by the expression; 

•^.,1 (S,,t) -L °x,t 
1-0 

vrlth defined In equation (6). ('Rie subscripts i and x here denote, 

respectively, mode Jt and stage x, and Indicates the number of spares avail- 
able a^ stage x.) T)m8, Rjj^jj(Sx,t) is the probability that stage x Is able 
to survive 1:. mo-ae £ until tlm-* t using no more than the spares provided 
for it. The expression for Rj^Ct) Is then: 




Hj(t) = .1 (Sx,t) 


. -» 2 ' . 


( 8 ) 


The rroduct over x in equation (8) thus represents the probability that 
all 3 -a.:es have survived in mode 1 with none requiring more than its 
allo’^ct : n unber of spares. The two exponentials in eq'oation (8) are 
siraily the probabilities of no category two and no category theee failures, 
respe j'. i vely . 

1:. orJer to aetermine Rg(t) it is con.-enlent to aefine some aaaitlor;al 
terxr- . First, note that: 

f - f'-' f - ^ 4. ^ 


^ ^ nrnbal'llity that the computer system successfully enters mode 2 
fell -'in/ a category 2 failure ana survives in mode 2 until time ‘ , ' ’ " ) the 
pretarilit^. that it successfully enters moue 2 due to a spares defici’ in some 


f 4 ,t ^ ^ ^ ^ f’o 1 • »>*#* C 
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lav let H^(^) traneltlon probability denalty for ata^e x; l.e., 

the probability density of a failure In stage x resulting In a successful 
degeneration to mode 2 due to the absence of any remaining operational spares. 
It followB that; 



I 


The first bracketed factor is simply the rate at which lailures afflict tne 
Qxi-unit ensemble defining stage x in mode 1. The second factor represents 
the probability that all but 1 of the spares for stage x have been used 
by time ▼ and that stage x is still operational at that time. The thira 
factor is the probability that the 1 remaining spares have all failed by time 
r . The last term is the recovery probability given that i spares must be 
i.ected and tnat degeneration to mode 2 is necessary. (The "primes" on the C 
6 terms indicate that the transitional coverage may be different than the 
coverage when no change is needed even though the same number of spares must 
be tested in both cases. The coverage model described In Section III provides 
a means for determining this difference by allowing user to specify transitional 
paraine'er.) Tbie prciuct of these terms, s.unmea over all i (i = 0,1, ..., S^.) 
is the lesired probability density. 


Further, let S„(t,’') be the probability tnat stage x survives until time 
T in mode 1 and from time r to time t in mode 2. This probability can be 
expressed in the form; 

I -0 





i’h.‘ I ruoKeted 'actors denote, i-^spectively , the probability that staye x 
survives until time r in mode 1 uslr.g exactly I spares, the probahiliti mat 




'S> 








-il- 


exactly J of the remaining S^-i spares are still operational at time t , and 
the probability that stage x survives for the remaining t - t in mode 2 given 
that J spares were operational at time t (cf. equation 8). The definition 
of th| parameter J' depends upon whether or not the Qxl“^x2 active units that 

■t- 

were needed in mode 1 but are not required in mode 2 are reassigned to the 
npareri pool. If they are, J' = J + Qxl " ^x2^ J' - J* The 

product of these terms summed over all combinations of unused and functioning 
spares yields the desired probability.* 


♦This expression for Sjj(t,»’) assumes that all remaining spares are tested 
ininediately following degeneration to mode 2. This allows any defective spares 
to be discarded at tha time and hence, if decreases the probability 

oi u covex'age laiXure in muae c. Ii this is not done 3^(t, >') becomes: 


Sx Sjj’ - Sjj 

£ = 0 i = 0 




+ 



L - 0 




E 




(i :) 




wi th 


A “ ' ■ ^x ~ 


0 Excess active units no: usea in Mode 

^xl " ^x2 Excess active units used as spares. 


This exi^r'-ssion assumes that tlie reassigned active units are the first to te 

if V * 1* WA U 4 .. t, W*t- * . . 4 ^ WW • 
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• -r 


We can now express T 2 (t) and T^(t) In tema of previously defined 
quantities: 

t 

“D y"[ ^ Sy(t,r)j^H^(^)jj^R^^2 (r,t-r)j|^e dr (12) 

t 

T2(t) = Sy(t, r)j ^X2 C 2 e’^2 j dr 

Ty.e first bracketed factor in the expression for T^(t) Is the probability that 
each of the stages comprising the conputer system except stage x survives in 
mode 1 until time r and in mode 2 from time r to time t (cf. equation (u)). 

The second factor is the probability density of a degeneration from mode 1 to 

mode 2 at time r caused by a spares deficit in stage x (see equation lO). 

The third factor is the probability that stage x then survives in mode 2 from 

time r to time t given a total of r functioning spares at time r . The 

definition of the parameter r, like that of parameter j' in equation 11, 

depends on whether the - ^x2”^ former active and still operational units not needed 

in mode 2 are reassigred as spares. If they are r » - ^x2”^' » r * 0. 

The last iactor in tht expression for probsMlit,. w.*.*- ... 

e.'ory two failures occur before the degeneration to mode 2; after that *ime, 
of course, category two failures are irrelevant. The product of these factors then 
is ttie procaLiiity -v..^ ^ whole sur^i 'er *ime ’■1’ mode 

1 and tat-i. contini4t,.j ..-ccessfullj in mo-ut until time t. Since the degeneration 
can occur at any time t in the range (0,t), this product. Integrated over this 
entire interval, provides the desired probability T^j(t). 
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Wie expression for T 2 (t) differs from that for T^(^) in that the 
cause of the degeneration is now a category two failure; consequently the 
product Hjj( T ) (r,t - t ), representing the probability density of a 

degenerative failure at stage x at time r followed by its successful operation 
in mode 2 until time t, is replaced by AgCg , the probability density of a 

recoverable category two failure at time r . The product in the expression 
for T 2 (t) is now over all stages, since in this case, none of them is the 
cause of the degeneration. That is, for successful operation through a 
category two failure, all stages must operate successfully in mode 1 prior 
to, and in mode 2 following, this event. 


The reliability model implemented in CARE II thus determines the 
reliability R(t) defined in equation 7 using the intermediate quantities 
defined in equations 8 and 9 through 12. As can be seen, it is a highly 
versatile model, allowing arbitrary active and dormant failure rates to be 
specified for each of up to eight stages with arbitrary nuatei's of spares 
assigned to each stage. Furthermore, the concept of coverage is fully 
integrated into the model with provision for specifying recovery probability 
as a function of the stage in question, the number of spares that have to 
be tested and the mode of operation (mode 1 , mode 2 , or transitional). Un- 
fortunately the CARE II user can rarely, if ever, be expected to be able 
to assign tnese coverage terms with any degree of confidence. For this reason, 
a coverage model was also defined and implemented as jart of CARE . Tl.is 
model, described in the next section, provides a means of determining 
coverage parameters in terms of more basic parameters which are presumably 
more reaJllv a-.r-rl! -’.i , *'• a* l-a. t \ 

user. 


-id- 


Ill* Cov«r*g« Nodlel Derivation 


r.Y',- 






The purpose of the covera«ie ujoutl to determine the covtrat;e 

coefficients associated with the computer system as a function of 

the stage afflicted by the failure, the operating mode (mode 1, mode 2, 

or transitional), the type of fault (permanent or transient), auid the 

number of failed spares encountered in the search for a functioning 

one. Although the coverage model iiupiemented in CARE II can be used to 

evaluate the coefficient Cj^, (the coverage given k - 1 failed spares) for any 

one use of such inl'ormaoion woula result in a somewnao more con^)licaoea 

reliability moriel than that described in the previous section. There 

- 1 ) 

it was assumed that Cj^ was of the form Ck » C8' ' for all i = 1,2,.... 

The coverage model is therefore used to determine and C 2 only, with 
8 defined as the ratio Cg/C^^. Since C8 ' ~ 'is presumably a reasonably 

good approximation to Cj^ for small k (and is exact for k = 1 and 2 when 
8 = C 2 /C 1 ) and since the likelihood of several consecutive failed spares 
is generally small in any event, the Increased computational time that 
would be needed to use the more general coverage coefficients Ck was not 
felt to be Justified. 


Conceptually, coverage can be broken down into three basic components: 
fault detection, fault isolation, and applications program recovery. Any 
one of the following evciits ».uiioUj.wutes a coverage failu.,w. 
to detect the fault, the failure to isolate it *0 the affected unit and 
to replace that unit with a functioning spare, or the inability to effect 
a timely recovery of the applications program. The mechanism by which 
t.iie failure is aetected presum»»ll "“termlnes the isolation ana recovery 
procedures; i.e., there is direct linkage, celled a CIK mechanism, from a 
n: e recovery rroce’ ~ 
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betvMn a partlouiLar fault and a partlcvQjo: detect or, hovever, may be more 
conplex. Frequently, several different detectors may be capable of detecting 
a certain fault. Which detector actually succeeds is a function of the 
computer operation being carried out when the fault becomes manifest. The 
CARt II coverage model provides a means of determining the probability that 
a fault in any specific class or subclass* Is eventually detected by a given 
detector in competition with other detectors as well as the distribution of 
the time delay before this detection taxes place, Kils informatioxi j.s uueii 
used in combination with user-provided statistics concerning the isolation 
and recovery mechanisms associated with that detector to determine a coverage 
coefficient for that detector. The summation of the probabilities of these 
mutually exclusive coverage events then establishes the coverage for the fault 
class in question. 

The concept of a fault class is basic to the CARE II coverage model. 

It cannot be assuxned, for example, that if detector A detects 9Cflfe of all 
faults in a I'iven computer stage and detector B also defects 90^t of these 

p 

faults, that together they detect l-(l-0.9) = 99% of all faults. It may 

well be that they both fail to detect the same 1C% of the possible faults 

and hence that together they are no more effective than either is alone, lu 
the CARE II coverage model, this difficulty was circumvented by requiring 
the user to categorize the possible faults afflicting any computer stage. 

A fault class is defined as a group of faults whose possible detection 

by any specific detector is statistically independent of its possible detection 


*The term fault class is used to denote a category of faults afflicting 
e stage; a fault subclass refers to a category of faults pertalnlnsr to 
a ouCtila,.;c. For purposes of this discussion this distinction i= not 
important . 
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tectora compete against each other acroaa fault claaees; they compete statls- 
bloally independently, however, within any specific class. In the previous 
example, the totality of possible faults at the stage in question, for ins- 
tance, mli'ht be divided into four disjoint classes each representing 2^^ 
of the total. If detector A were then capable of detecting IOO56 of the 
1‘aults in the first two classes and 80^ of the faults in the third and 
fourth classes, and detector B capable of detecting looit of the f stilts 
classes 1, 2 and 3 and 60^ of the class k faults, they both would be able 
individually to detect 9 Cff> of the faults as before. Together, however, the^ 
would detect: 


3/4 1/*^ - 0*®) 


of ttic faults. If in contrast, both detectors were 100^ effective in the 
first three classes and 60^ effective in the fourth, so that again both 
are 90^ effective overall, their combined effectiveness would be: 

3/U f 1/4 (1 - (1 - 0.6)^) = ^ 


The task of segreP’ating faults intc classes requires careful anal^'sis 
of the ]03slble faults that can occur and of the characteristics of the 
r-./allabl' detectors. The success with which this categorization has been 
acor'rnp] < c;hed can be tested by determining for each fault within a given 
class, its probability of detection by each of the available detectors. 
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If this set of detection larobflAllltl** ie Identical for all faults In 
an> Claes, a sufficient nuinber of classes has been defined; otherwise, 
further subdivision or reclassification Is needed. Although fault clas- 
sification does require detailed knowledge of the types of faults that 
can occur, no coverage model can provide a meaningful measure of coverage, 
unless this information or its equivalent Is determined. 


Once the faults relevant to each computer stage are categorized, the 
user must then characterize each detector 1 for each fault class J in each 
stage by a detection probability, a detection rate f(t)« (as usual, subscripts 

will be omitted in the following discussion unless they are needed for purposes 

^ f" >-• 1 o ^ o o'.*- -»r ''■—o ^ o aphedul'* 

^ t s \ * -*•- -• - X, 

rule (for software diagnostics). He must in addition associate with each 
letector an isolation procedure (characterized by an isolation probability 
and an isolation rate h(t), and a recovery procedure (characterized by the 
proballlity r , ^ ' ) = r' ( *■) r ' ' (^ + »■' ) with t and t' the times needed 
for detection and isolation, respectively. (This form for r {r , r’), although 
soraewnul restrictive, is felt to oe sufficiently' general to encompass the 
vast majority of recovery mechanisms. Factoring r{ r , t') in this way implies 
that the recovery probability is the product of two terms: (l) an error- 

{•roxjagatlon recovejy probability whicn is a function only of the time t 
(lurln/; which the faolt condition existed but was undetected; and (2) a 
tinn-'-lost recovery probability which is a function of the total time r + r ’ 
elapsed between the occurrence of the error and its isolation. The reason 
•'or : i • iiit 'il shin.' between • hese ' v. ions is that iiiri... ..l - 

propagation period, the effect of the still undetected error could propagate 


_1R. 



to other computer elements, thereby complicating recovery auid reducing 
the probability below that resulting from conalderatlom of the total 
"dovm-tlme" r + t' alone. 

On the basis of this user -supplied Information, the coverage model 
determines the resulting coverage (wher k spares have to be tested) 

for each fault class J, and detector 1, and, for each stage, calculates 
the coverage: 


Ok “E E O^jk (X3) 

J 1 

with dj the fraction of total faults belonging to class J. That Is, 
represents the fault class J coverage associated with the DIR mechanism 1. 
Since detection by detector 1 precludes detection by any other detector, 
the summation over 1 of these terras represents the total coverage for fault 
class J and the weighted sum over all fault classes thus determines the 
overall coverage for the stage In question. 


Let and P'ij be, respectively, the detection ana isoiatj.on proca- 
cillties associated with detector 1 operating alone in the presence of a 
:iass ) fault. Let P' ' ^ be the probability of successfully testing a 
spare required as a result of a class J fault and let Tg represent the 
' iftif* needed to complete this test. Finally, let Pj^j bhe prob- 

ability density of the detection of a fault in class J by detector 1 when 
the total fault detector environment is taken into account. (Thus, fj^j(t) 
^3 tr.; vietection rate of the detector 1 when no competing dete^'t/rs arc 
present; the arialogous function when these competing detectors are 
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considered.) It will be shown presently that g^j(t) can be expressed 
In terms of the set of density functfotn: 

^ •••f i# ••• 

The covera-ie term can be expressed In terms of the functions and parameters 

defined in the preceding paragraphs in the form: 


t^lJK • '■ij Pij (P 






(t ' - kr ) r''iA(T + r') dr'dr (l^) 


T^ie detection probability density function for the 1^^* detector is 

and the a.ssoclated isolation density function is, by definition, 
p' r' ), If k spares must be checked in order to recover successfully 
from a fault, the overall recovery probability is decreased by the factor 
with P' ' the probability of successfully checking out a spare, and 
the ioolution delay is effectively Increased by the factor kTs. Hie term 

C, . Ir thus equal to the conditional probability that the system can still 

1 ;k 

recov( r ,'iven a r-second detection delay tiroes the detection probability 
density function, multiplied by the conditional probability that the system 
can recover given thiat it has survived a r-second detection delay and must 
in addition undergo a total of r + t' seconds down-time, times the corresponding 
isolation density function, the whole thing integrated over all r and , 
and mull i plied by (P' ' 


The only term in equation (l4) not immediately attainable from the 
■ 1 for I'le • j on provld»’d by the user Is the rondi onal density fun<'tlon 


for this term Is a function, not Just of the detector 1, but of the entire 
enseab^ of d^tcdifec^ and their Interrelatlonehlpa. (Rote that the same 

• t 

conipe^'ltlve relation does not exist betv, 'n Isolation and recovery procedures 
since these procedures are uniquely detenaln'.d by the outcome of the competition 
amon^ the detectors.) In order to define lb vlU be useful to Intro-* 

duce some nev tannlnology. (For notatlonal convenience, the subscript J denoting 
the fault class In question vlU be dropped In the ensulz^ discussion.) Let 
Ti ■ n^T be the periodicity with which software dieignostlc program 1 Is scheduled, 
let n^, be the least common multiple of the nj, and let ncT be defined as the 
major cycle. Let tj^, tj; + At^ bv: the start and finish times for detector or 
diagnostic program 1 measured with respect wO the occurrence of a fault for 
hardware detectors, €md relative to the start of a major cycle for software 
diagnostics. 


Let tj-^ * tjA (l) be the maximum value (with respect to v ) not exceeding 
bi + '^1 of the expression (tj + vTj - ( I -l)Ti); that is, let bj ^ (j^- i) be 
the time of the last occurrence of diagnostic program J in the interval separating 
the ( 1 - l)®^ and the l^h occurrence of diagnostic program 1. Finally let: 



Fi( V ) 



i»< 0 

0< v<^ ti 
V > At I 


and let Fi ( *? ) « 1 - Fj ( >7 ). (Note thc.t fjCt) is a normalized detection 
probability density; its Integral over the At^ - second detection Interval 
Is unity. Note, too, that the detection delay tj^ is treated here as an explicit 
parameter. Thus the detection probability density associated with the l^b de- 
tector is of the form Pj^f(t - tj^) with f(t) nonzero only over the interval 
0^ t Ati. ) 
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We next observe that the detection rate of the 1^^ aoitvare diagnostic 
program when coapatlng only with the other diagnoatlc prograne la for, 

0-^ V stuT, 


ric/ni Ati 

J n 1^1 - Pj njT ♦ tj - t/)J fi(l) d, (15) 

^ f = 1 0 .)^1 

and 1b Identically zero elsewhere. The product here is taken over all j 
representing other software diagnostic programs. IMs expression Is most 
easily explained witli reference to Figure 2. The crossed-hatched rectangle 
there indicates the .ime interval during which the 1^*^ diagnostic routine is 

run for theOt ^ l)st time* during a major cycle. If this routine 
a fault which occurred exactly t seconds earlier, all other dlagnosltc 
y>rOi'rams which were run between these two Instances of time (shown as 
»Ju:ih».d lines in Fif'ure 2) must tiave fulled to detect the fault. The prob- 
abJllty of this evetit is I'iven by the product over j in equation (l5)* (it 
is assutwfi h(.T«. that if a fault occurs while the routine is being run, 

. .. pj onauilitjf tnat that iuult is detected before the routine is concluded 

/ At j _ 

fj(^) dr = FjF(t), with t the time the fault occurred relative to 

• fic herir'ning of ♦ he routine in question.) This product multiplied by the 
detection rate of the i^^ diagnostic routine and integrated ever all’? 

yir^ld" trie conditional detection rate of the i^^* software diagnostic test during 
itr (/. + l)st execution in a major cycle. It remains only to sum thl= « ^ ^ ^ 

o.er all runs in each major cycle to obtain the desired detection rate. 


ORJGINAU PAGE IS 
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The condition detection rates for software detectors when competing with 
both hardware and other software detectors can then be expressed in the farm: 

(Jlj (r) = n fl - P/k(^- tK)]2'iJ 

k L 

with the product taken over all k corresponding to conpeting hardware detectors 
That l3,.5j^j(T) is simply equal to g'^j(r) times the probabi 3 J.ty that none of th 
competing hardware detectors has detected the fault first. 

Similarly, the conditional detection rate of the i^^ hardware detector in 
the presence of its competitive hardware and software detectors is: 

- P|^k(' - tk)][i ^yf ti) (IT) 

J' 0 

T»u. pro<luct here, taken over all k representing conpetltive hardware 
detectors, is ai'ain the probability that none of these detectors has 
alreauy succeeded in detecting the fault. The second bracketed term 
iiiultiplies this probability by the probability that none of the software 
detectors has been successful either. (The siunraatlon is over all j* 
comfjctlri;' software detectors. Note that the g'ij(^) functions represent 
rr. ;! uaily e xclusive e /ents; hence their weighted sum is indeed the procability 
density function of interest.) 

These expressions for (equations Id and 1?) are used in equation 

(l>'») to determine the coverage terms for all 1 , . 1 , and k. These terras 

are use t in equation (13) to calculate a coverage probability, for each stage 
and ffiul'- type, as a function of the r.*' srer«*r tha-*- have le ‘eoted. 
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The resulting coverage probabilities are then used in the reliability calculations 
described in Section II. The examples presented in the following section are 
intended to Illustrate this procedure more fully. 

TV, Sortie Examples 

We first consider two examples involving the reliability model presented 
In Sr^ftion TT. Hoth examples are sufficiently simple that they can he solve'l 
analy l.lcally; each is intended to illustrate the significance of a subset of 
biie various pax*anieters derinlng the nioctel. 

In the first e>;ample we postulate a computer system consisting of only one stage 
with 6 ajjares. Also, in order to concentrate on the parameters of greatest interest, 

V(- aosuirx.* Uiat the category two and category three failure rates, A 2 ^3> 

t.» r- rat e of occurrence y of nonrecoverable transient failures are all negligible. 

« We then nave, from equation 5 , 

0(1, t) - - 1^(1 - € e 


V.:ta M KC«<i6‘^. 


1 .. ■ ajwv.;..(_ • :.u’. - - ^ S' = 3, conditiw.w — .. 

(tenerall., hold at least anjiroxinetel', , we nave, from equation 10» 


ll(-) SiAO^ (i - e 


s , S fit 


1=0 
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emd from equation 8, 


1=0 


i i -KQi^it 

a e 


and f 


i i -KQgM (l . r) 


n ( . ^ -V ^ ^ - e « 

‘2-r V ^ ' 

i=0 

On substituting the expression for - ■r) into equation 12, 

obtain; 


we 


( . \ 


[ 

I \ / 

Jo 


-.-Cv; ''IE »T V ■ VV 

^ i=c ' ' Jq 


p ( -V* * 




Hsln.' 


bir.oir.ial expansion for (i 


e 



ani 




Ci 


-n{t 



we 


can eauilj 


-arry oui '.he integration overt, 


obtaining: 
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Since we will be interested in the limiting case In which the dormancy 

factor K » l/#* * (l.e. in irtiich m— ►O), we observe that G(l,t) 

0 

becomes, under these conditions. 


11m G(i,t) 
K — ► 00 
K/i A 




il 


(NoIa; !-hut In this e^ent G(i,t) is independent of j . This is due to the I'act 
that with fi= 0 the first tested spare is guaranteed to be functioning.) 

When K — ► ®, M — ► 0 (with Km * A ) the above expressions thus become: 

hC’-) ^q^AC (CQj^Ar) ^ -Qj^At 
S’. 


«1,:- (QgCA(t -t)) ^ -r) 

1^ ^ * 


anu 


R. (t) - 


s * 1 ^ 

^ ‘ J c 


S, xi 

(t - r) e 


f- 


T 


d r e 


The integral in this last expression can be evaluated by again using the 
binomial expansion with the result that: 




r i 

EE 

i=0 «=o 


S+i+l , S+1 1 

C Cig 

— 






/3=0 


-Q^At 

e^2 


In cither of these cases, of course, the total reliability is Just 
Rl(b) + Rp(t). 

In order to reduce these results to a more tractable form, consider the 
special case in which corresponds to the si o ukxOxOii 

in which the system begins operating in mode 1 with three active units and two 
It switches to mode 2 after the third failure and operates using only 
one active unit. The remaining active unit that was still functioning in mode 1 
and is no longer needed in mode 2 may be either discarded (r * O) or reassigned 
to the spares pool and used in the event of a subsequent failure (r = l). To 
emphasize the relative importance of the various parameters Influencing the 
system rcliahility, we limit consideration to the following cases: 

K' ■> (dormant units arc as llxely to fail as active units) and K = ® (dormant 
units never fail); 6=1 (coverage is independent of the number of failed spares 
tfiat have to be tested prior to recovery) and 6=0 (recovery is possible only 
il' no sfwires have failed). The resulting expressions are tabulated in Table 1 
as a function of P =- e"A^, the probability that a single unit survives until 
time t, and the coverage C. Several of these expressions are then tabulated in 
Tfi> 1- nr. a (’unction of P for .'arious values of C. 

Tiic preceding example considered t;:e c^'Tect of various per’-i’-- • 

(e.g. K,6,C) on the reliability of a single-stage system. We now consider a 
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NU»«RICAL RELIABILITDCS 
« 3» ^ ^ * 2) 


PARAMETERL) 

UNIT 

RELIABILITY 

P 

bill 

R 2 (t) 
r a 0 

Rp(^) 
r » 1 

TOTAL 

TOTAL 
r * 1 

rH 

II 

0 

0.2 

.0579 

.4096 

.6144 

.4675 

.6723 

K 1 

0.4 

.3174 

.4752 

.6048 



i - 1 

- 0.0 

. ' >826 

.2688 

.3072 

.9514 

.9898 


0.8 

.9421 

.0544 

.0576 

.9965 

.9997 


0.9 

.9914 

.0083 

.0085 

.9997 

.9999 

c = .y 

0.2 

.0509 

.3205 

.4647 

.3714 

.5156 

K = 1 

O.k 

.2828 

.3719 

.4632 

.6547 

.7460 

i = 1 

0.6 

.6219 

.2103 

.2373 

.8322 

.8592 


0.8 

.8908 

.0426 

.0449 

.9334 

.9357 


0.9 

.9622 

.0065 

.006b 

.9687 

.9688 

C 1 

0.2 

.0902 

.1843 

.2765 

.2345 

.3267 

K - 1 

o.u 

.2829 

.2138 

.2721 

.4967 

.5550 

4=0 

0.0 

.03OT 

.1210 

.1383 

.7517 

.7890 


0.8 

.9114 

.0245 

.0259 

.9359 

. 93*^3 


0.9 

.9805 

.0037 

.0038 

.9842 

.9843 

C = 1 

n .2 

.1399 

. 4?12 

.5924 

.5611 


K -• » 

0 i 4 

.4817 

.3755 

.4887 

.8572 

.9704 

V Ui i> 1 0i bki'j 

O.h 

.tioar 

.1708 

.1902 

.9715 

.9909 


0.8 

.9699 

.0287 

.0304 

.9982 

.9999 


0.9 

.9958 

.0040 

.0041 

.9998 

.9999 
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oulti-stage system but with some of these parameters restricted In order to 
keep the analysis reasonably tractable. In particular, we assume that each of 
the n stages has the same failure rate A, that “ 2 , ■ 1 , « 1 , and 

* Ax * ^ stages. As before, we also assume that y = A2 ® A^ * 0 . 


Under the conditions Just specified, equation 8 becomes 
R^(t) » P^"^l + 2C (1 - P)j 

with P » e the reliability of a single stage In the n-stage system. It 
is also readily verified that equations 10, 11, and 8 become, respectively, 


= 2AC ( 2 C +«) (1 - e ) e 
(t,»^) « Sq + a^^ e + a2 e 


and 

with 

ao = r C P^ ^ 9 -gL- P - 1 - 2 C^ 

83^ = (1 + rC) (1 + 2 C) P + C (r - «r - 1 - rC) P^ 

^2 = -C (l+r - i^+^C) P 

A- i-Vip pp ii-p r = 1 1 r priv ’!*'■? *■'* ”hlcyi were active in 

mode 1 an"! are still functioning are to be used as spares in mode Z and r - 0 
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otherwise, (it Is assumed that either r » 1 for all stages, or r ■ 0 for 

all stages.) Aius, from equation 12, since all staiges have Identical parameters, 


RgCt) 


I (»o + ai« 


Sx (r; (0, t -r) dr 

-2Xr.n - 1 , -Xr -2Xr 
J ) (e - e ) dr 


with b a 2C (2C +6). Using the trinomial expansion 


(ao + + SgX ) 


2\n - 1 


■E 

1 , 

i+J+k=n-l 


iurJjl. 

11 Ji ki 


ao^ X ^ 


and carrying out the Integration, we obtain: 


R^(t) = bp 


a^^ a 

14 1V! o 


n! 


i , J ,k 
i+J+k=T;-l 


11 Jl k! 


[ , „ J + 2k + 1 j 

JL:u2_ . 1 - P ^ 

J*2k+1 J + 2k» 


When n = 2, for txanqile, we find from the above that: 


R(t) = R^(t) + RgCt) 

2 

= P** [^1 + 2C (1 - P)J 

- 2 (C -p 

+2 (2 + 9C - 2d) P^ - (TC - d) P^]p^ 
with <1 - ( 1 -C/2 +V 2 ) rC 


2d) P - 6 ( 1 + 2C - d) P'^ 


^ (2C -t-6) |j<2 4 3C + dj 


♦ 2 K 4 
2 
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This result Is tabulated In Table 3 as a function of , the probability 
that a two-stage nonredundant system would have survived until time t, for 
various values of the parameters C,l , and r. 

TABI£ 3 

NUfERICAL RELIABILITIES 
TWO-STAGE CONFIG’JRATICflM 
Qg » 1, « 2 , S = 1 


PABAfCTERS 

IRREDUNDANT 
SYSTEM RELIABILITY 

Rl(t) 

Rp(^) 
r » 0 

R2(t) 
r = 1 

TOTAL 
r = 0 

TOTAL 
r » 1 

C = 1 

0.2 

.1773 

.U387 

. 513 '* 

.6160 

.6901 

5=1 

o.u 

.U 8 l 7 

.3923 

.U 215 

. 87»*0 

.9032 


0.6 

.7577 

.2133 

.2195 

.9701 

.9772 


0.8 

.9388 

.0585 

.0588 

.9973 

.9976 


0.9 

.98U8 

.OIU9 

.OIU9 

.9997 

.9997 

C = 0.9 

0.2 

.1592 

.3511 

.UOUO 

.5103 

.5632 

5 -- 1 

O.U 

.UU17 

.3171 

.3378 

.7588 

.7795 


0.6 

. 111 k 

.1743 

.1787 

.8857 

.8901 


0.8 

.906U 

.0U8U 

.OU86 

.95U8 

.9550 


0.9 

.9666 

.012U 

.012U 

.9790 

.9790 

C = 1 

0.2 

.1773 

.2925 

.3360 

.U698 

.5133 

c - 0 

O.U 

.U817 

.2615 

.2786 

.7U32 

.7603 


0.6 

.7577 

.1U22 

.IU58 

.8999 

.9035 


0.8 

.9388 

.0390 

.0392 

.977P 



0.9 

.98U8 

.0099 

.0100 

m 

• 99 U 7 

.99U8 
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To Illustrate the use >f the coverage model, consider a set of three 
software tests and two hardware detectors designed to detect faults in a 
given category. Let the three software tests tieve detection probabilities 
P^, i = 1, 2 , 3, and detection rates, 

0 S < ^tj 
Otherwise 

Further, let test 1 oe run every minor cycle (i.e. every T seconds), test 
2 every other minor cycle, and test 3 every third minor cycle, and let them 
te scheduled as shown In Figure 3 • Then, with reference to equation 1*5, ve 
have - 1, n^ = 2, n^ = 3, n^ » 1cm (l, 2, 3) » 6. It is convenient to assume 
that the time separation between the i^*^ and software tests exceeds the 
duration of Loth (l.e. that t^j> Atj + Atj) since this condition considerably 
sirnjjll lies the resulting expressions for gj' (t). 


li(r) - 

i - 1, 2, 3 


lit 

0 


FIGURE 3 

DIAGNOSTIC TEST SCHEDUI£S 


^12 


m -P) 


t (T| 

T 
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Making this assunptlon, deflnlikg 


'i.! 




1 - Pj h - (HJ tij - Ati < r < t^j 

2 V At^ / 

1 - - (tjj + Ati)^^ 4 , < r < + Atj 

2 \ ITi / 


1 - p. 


■♦• Ati < 


nnl letting (’■) be similiarly defined but with t^^j replaced everywhere by + T, 

we can carry out the integration in equation 15 obtaining; 


and 


«p’ (^) 


(r) 


1 

~Tf~ 

(s * P31 (0)(i * Pgi (')) 

0< »■ < T 
T < ^ 

0 



1 

6 t 

V * "Si <'> * '32' 

0 < T < 2 T 

2 T< r 

0 



1 

P13 ( 0(^33 (') ^ ^23' (O) 

0 < r < 3 T 
3 T< r 


0 


Now ';tJi)por.o th<- two hardware detectors can both le modeled as impulse 
'iL-Vvc'.or':, ortr with a de»'c‘io:. ielt. of t^ seconds and Mie ' •••" 

ti, oocond delay. It then follows from equation l6, ttat; 
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with i(t) the Dirac delta function. 

Suppose further that the isolation procedure associated with each of these 
u •f» f*tors I'equires exactly seconds (i.e. hj^ (f) = j (r - t^), all i), that 
t amount of time needed to test a spare is tq seconds, and that total recovrry 
ij possible i !* and only if the fault is both detected and isolated within seconds 
cf Its occurrence (i.e. r^^ ^ " '^r) with u (t) a unit step 

■jnetion: u (▼) = 0, »■ < 0; u (»’) = 1 , r>0). Then, from equation I 4 , ve hav.;: 



Alth.-U;-':. w- could coi.tiiiue witn tnis example at tne present level of aonerallry, 
»-'.-:;ul's are i(»ore read’.ly interpretable if we add some additional constreints. In 
oar' Icula r, 
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The above expression for then becomes: 
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Numerical values of these coverage coefficients are tabulated In Table 4 as 
a 1 ‘unction of the probability P that any one of the detectors would by itself 
eventually detect the fault in the absence of any competition. 

TABLE 4 

COVERAGE CCiWICIENTS 


p 

k 

^Ik 

^2k 

^3k 

^4k 

Sk 

^tot 

6 = C^/C 

1 

1 

.2500 

.1250 

.0833 

.5417 

0 

l.OOOC 


1 

2 

.2500 

.1250 

.0833 

.5417 

0 

1.0000 

1.0000 

.99 

1 

.2495 

.1242 

.0831 

.5408 

.0025 

1.0000 



2 

.240U 

.1242 

.0831 

.5406 

.0025 

1.0000 

1.0000 

•9'> 

1 

.2474 

.1210 

.0819 

.5364 

.0133 

1.0000 


• 9> 

2 

.24/2 

.1209 

.0819 

.5364 

.0133 

0 . 99 T 

0.999” 

.90 

1 

.2452 

.1172 

.0806 

.5288 

.0279 

.9997 


.90 

o 

.2443 

.1171 

.0805 

.5288 

.0279 

.9986 

0.9989 

.30 


.2415 

.llOS 

.0779 

.5067 

.0597 

.9064 



r. 

C. 

.2380 

.1101 

.0776 

.50b^ 

.059^' 

.9921 

0.9957 

.oO 

1 

.2327 

.1013 

.0727 

.4350 

.1224 

.9^j4l 


r , 

«. 

.2193 

.0974 

0 

0 

• 

.4350 

.1224 

.OaU'- 

qrroP 

. -'iC 

1 

i. 

.208 

.0917 

.0647 

.3267 

.1584 

.8502 


.40 

2 


.0823 

.0590 

.3267 

.1584 

.8102 

.9530 

. 7.0 

1 

.1440 

.0670 

.0458 

.1817 

.131''' 

.5"02 


. '0 


.1171 

.054'^. 

.0381 

.1817 

.1317 
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Several observations can be made concerning this x^cirtlcular exanqple. 

It will be noted, for example, that the Hardware detector having the T/4-second 
detection delay is generally the most effective device althoug^l Its advantage 
decreases with decreasing P. The relative effectiveness of the software diag- 
nostic programs is highly correlated to the frequency with which these programs 
are run and is comparatively independent of the order in which they are scheduled. 
These conclusions, however, are strongly Influenced by the fact that the detection 
and isolation procedures must be completed in a relatively short time in this 
example for the recovery process to be successful. 
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V. EXTEMSION TO 'fHREE AND MJRi: tPCES 


The CAPIE II reliability model can readily be extended to include three 
or more modes of operation although the complexity of the resulting analytical 
expressions Increases correspondingly. (The CARE II coverage model allows 
coverage to be determined as a function of the mode of oi>eration, so it already 
is sufficient to model an arbitrary number of modes.) This increased complexity 
is due in part to the fact that two or more mode changes are now possible and 
ir» part to the increased number of ways in which this sequence of mode changes 
can be instigated. (E.g. a spares depletion in stage 1 could cause a change 
from rnoJc 1 >.o mode 2 followed by a spares depletion in stage J causing a 
do'enexution to mode 3 with both i = J and i / J as distinct possibilities.) 

N« vert he less, the appropriate reliability express! cns are straightforward 
extensions of these described in Section II. 

Consider first the case of three modes. In analogy with the category 
two and category three failures defined earlier, let a category three failure 
Le redefined as a failure that, by itself, prevents operation in mode 1 or 
r but not in mcxie 3> and let a category four failure be one precluding all 
ivi‘ : ft (i.e., a ingle-point failure for the three-mode system). Thier, 

• ni.- r'-i lat ilit.y of ne tiiree-raode conl'lguratlon can ce written: 

H(t) -- (Rj(t) ^ Ho(t) + R3(t))e (18) 

wltr. Rj^(t) and R 2 (t) exactly as defined in Section II, R-^(t) the protar i lit./ 

• -'at '.hiu- system sua'vives until time t iiaving degenerated to mode 3 sometime 

prior to t, and e the probability Miat no categoi., fuui- fa^^-ucs, occurring 

at a r.'» • fallur' s per ur.lt time, have taken place by time t. 










Since expreaftlons for R^(t) and R^(^) have already been derived it 
raaalhe only to determine R^Ct). To thla end ue observe that there are six 
mut tally exclusive catefjoriea of fault sequences tliat can result In a degen- 
eration to nxxle 3: (l) A fault in stage x causes degeneration to mode 2 

artfi a subsequent, fault in stage z causes degeneration t mode 3 » with x ^ z. 

{?) The previous fault sequence takes place but with x =» z, (3) A category 
two fault caueeb degener^ion to mode 2 and a subsequent failure in stage x 
causes degeneration to mode 3* (^) A failure in stage x causes degeneration 

to mode 2 and a category throe failure causes degeneration to mode 3* (5) 

A category two failure causes degeneration to mode 2 and a category three 
failure causes der.eneration to mode 3. (6) A category three failure forces 

i.fic .nysi.em to degenerate directly from mode 1 to mode 3* (it la implicitly 
u.s.'jufm.d by this verbal description that the number of operational units re- 
qulr« «j for ariy In mode 1 is at least one greater than that required for 

u|H r’rji.IrHi In mode .1 for 1 <, 1 . Thus, no single failure in any stage can by 
i».3<if cause degeneration from mode 1 to mode 3* The expressions to be derived, 
however, will remain valid even if this assumption is violated provided the 
proper limltlr.g conditions are observed.) 


Let denote the probability that the event itemized in the previous 


pprorfraph actually takes place. Then: 

'E ^3/"’ 

1=1 


(19) 


In order to derive expressions for the terms, it is convenient to define 

fionfc additional terms analogous to those used in Section II. Specifically, let 


ortgikau page is 

.OE POOR QUALITV 


-Ul- 



Ux = 


"'l J 

'E, ■ 

. 1=0 1-0 


>>(!)( 


■<'2 ■ - 1 - l >‘ x <'2 - ' l ), 


J . 'P - lU - * ■'‘* ® ■ ■^')‘' *' ■ 


V (' i . ' 2 - t ) = 


X Rx3^^’> ^ “ ^2^ 

( 20 ) 


^ x ” \ 

j =0 k =0 


Sx - J - k ^ -kM, 


Hx , 2^^’^"2 - ' l)x 


Wx (^l»V» ”xl(’’i) “x2(^l» ’S "’’l) Rx3^^2^ ^ -^) 


X «x3^^2^ - "2^ 

( 21 ) 

( 22 ) 


and let 


Sx - J 


Gx (' i > Tg , t ) ^ ^ xl ^'^^’’ l ^( k ) 


( 1 - e 


^ x ’’!) ^x 


- 


J " k - kMx’’i 
e * 


J =0 k = 0 


-EE ’ 

.) = 0 i * 0 


(1 - 


-^ x ("2 ■ " l )^ k ’ - J - 1 , 


( 23 ) 


X e 


('2 ■ ' l ^ 


R ^ jd ; t - r ^) 


- 42 - 



These expressions denote the probability (density) that stage x survives 
In mode 1 and until time In mode 2 until time Tg and In mode 3 until 
time t. They differ In that ) represents the case In which the 

first mode change is due to a depletion of spares in stage x Itself, 

t ) the case In which x causes the second mode change, ^ 

the case In which it causes both changes, and ^ case in which 

neither of the changes is due to a spares deficiency in stage x. These 
expressions are entirely analogous to those described in Section II. As there, 
these expressions also assume that all spares not already known to be defective 
are tested at each mode change. The terms 

as defined in Section II (the second subscript refers to the mode), and 

') = Hjj£(»’) with Sjj replaced by 1. (When i = 2 here, the coverage terms 
becoric Cjj ' ' and ^x'*> double ''primes” indicating thiat the mode 2 to mode 3 
transl Uonal values of these terms are to be used.) Again, as in Section II, 
the "prime" on an integer indicates that this peo-ameter is to be 
incremented as appropriate if active units are reassigned ^o the spares pool. 


(E.g. in the expression for Ux^^l^ ^2* i' =1 if units are not reassigned 
and i' =i + Q ^2 " ^xo they are. Similarly, k' = k 0,^2^ - in the 

expression for t) if the active units of stage x are reassigned at 

the time r^, and k' = k otherwise.) The parameter r 2 ^ is also as previously 
defined and V 2 in the corresponding term when the transition is from mode 2 


to mode 3^ r^ = 0 if units are not reassigned; ^■2 " ‘^x2 ■ *x3 - 1 if tney 


are . 


The probabilities T^^Ct) can now be expressed in terms of these functions: 
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The extension to four end more modes Is strelghtforverd but the resulting 
ex,.resslons become correspondingly more ccnplex. Tor four modes the reUeblUty 
ass'inn." t.he form 

t).(p,(t)tR2(t)tP3(t))e-<^^*^^>‘ tR,(t)e-V (23) 


h( 


arrl T r five inodes 


-( X ), + •*• ^ 6 ) ^ -(^5 ^ 6 ^ , \ -^ 6 ^ 

B(t) (Bl('-) • * '*3<'^’) ' * ' 


(26) 
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etc. The terms Rj^(t), R2(t), and R^Ct) are as previously defined. Tr.e 
probability Rj^(t) involves twenty-two terms similar to the T^^(t) terms Just 
defined but with many of these terms triple integrals; R^(t) requires seventy- 
three such terms many of which are quadruple Integrals. 

While the CARl*: II prcigram could in principle be extended to include more 
trtan two modes, It is apparent from the preceding discussion that the time 
required to calculate R(t) for any particular set of parameters Is an ex- 
ponentially Increasing function of the number of modes allowed. If additional 
restrictions were imposed on the model, however, the time needed to con5)lete a 
com])utation could oe brought back to reasonable values. If, for example. It 
could t e establish ,d that failures of category two and higher were sufficiently 
unlikely to be Ignored, the complexity of the computation could decrease dram- 
atically. Only two of the six T^j^Ct) terms required In the evaluation of R^(t), 
J'or example, would remain; R^(t) would Involve only five triple integrals and 
Rr(t) fifteen quadruple integrals rather than the seventy-three previously men- 
tioned. In addition, the number of terms would reduce still further (and the 
resulting terms wo ild he simpler) if the number of stages comprising tne comp- 
uter system were restricted, (in the above discussion it was assumed that the 
numLor of stages wis at least as great as the number of modes minus one (i.e. 
tnree stages in a i-mode configuration, etc.) If only two stages were allowed, 
for example, the n'ornber of terms needed to evaluate R^(t) would be reduced 
further from fifteen to eight; and only one such term is required to model 
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a single-stage system. 



VI. CONCLUSIONS AND RECOMMENDATIONS 


The CARE II program as it currently exists is an extremely versatile 
tool for modeling the reliability of a dual-mode computer system. The computer 
can be segregated into as many as eiglit different stages each with its own 
coverage parameters, active and dormant failure rates, and its own complement 
of spares. A mode change can result upon the exhaustion of spares in any 
stage or from the inability to operate in a dual mode even when adequate units 
are available at each stage (category two failures). 

Ai in oection V, the extension of this model to include systems 

capable of operating in three or more modes is conceptually straightforward. 
Unfortunately, the resulting computational time can become excessive unless 
some restrictions are placed on the generality of the model. Several possi- 
bilities were identified in this regard. 

Two approaches suggest themselves for implementing this extended version 
of CARE II (with or without these additional restrictions). The obvious approach 
is to use the present CARE II to calculate R^_(t) and R2(t) and augment it with 
new subroutines to determine R^(t), Rj^(t), etc. The major difficulty with this 
approach is tne excessive time required to evaluate the resulting multiple Integral 
(In general, an t -mode model involves (?,-l)-fold integrals.) An alternative 
approach using leiplace transforms to eliminate the need for any integration 
appears promising. Further Investigation is needed to determine the ea'-e with 
which the consequent inverse transforms can be evaluated by computer and to 
estimate the complexity of the resulting program as a function of the number 
of modes in the reliability model. 

In any event, the extended program would consist primarily of a subroutine 
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for evaluating R^Ct) with i the maximum number of modes likely ever to be 
required (e.g. i = 5). The terms R^.^Ct), Ri.2^^^ determined 

by repeatedly using this same subroutine with appropriate substitutions. 

That is, suppose a subroutine were written to calculate: 

6 

1-1 

with the T^^(t) terms as defined in equation 24. Then: 

R2(t) =(T3 i' (t) + T33' (t))e "^ 3 ^ 

where T 3 ^’ (t) = T 3 ^ (t) with ^ ^ ’'2> replaced by Sy(r^, tg# t) 6 (t^ 

z 

Similar substitutions would yield R^(t) and the desired result would be obtair.e 
after three (or, in general, i) successive iterations using this sane subroutii;e 



APPENDIX 


A PROOF OF EQUATION 


'I^ie rocurelori 


0(m + 1, t) 


m ft 
1=0 0 


KQM 0(1, t ) (l - e 


-MT 


. m - 1 
) e 


•itT m 

a 


i -KQM (t -t) 
e dr 


follows directly from the definition of G(i,t). That is, exactly m + 1 spares are 
used Dy time t if a failure occurred in the infinitesimal time interval + dr) 

(an event having probability KQP It ), if exactly i spares had been used up to 
that point [o (l, r )), if the first ra - 1 spares tested are defective ^ ( 1 - e ** ) 
but the (ra - 1 + l)®^ spare is operational (e"**’^), if recovery is possible under 
these conditions (c4 ® " i) and if the system survives for the remaining t - r 
units of time without any additional failures (e Integrating the 

resulting probability density over all 0<^<t, yields the conditional 
probability that exactly m + 1 spares are used given that m - i + 1 spares 
had to be tested during recovery frean the last failure in that interval. Since, 
for any m > 0, i must be an integer in the range 0<i<m, since these ev'ints 
are Ku'ually exclu; Ive for different Integers i, and since at least one fajlure 
must have occurred for any m>0, th<? sum of these probabilities over all 1, 
0<i<jr., must eqijaj. G( m + 1, t). 


rtvi'V assume that, equation ( 5 ) is true for all G(i, t) with Then 

substituting this expression for G(1, t) into equation (2?) and rearranging 
tents, we obtain; 

l(ra + 1, t) = 





M + i 
1 



-HT ^ rc -uT 
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(1 ■ 

m 



-Mt. m + 1 -KQMt 
e ) e 


Thus, since equation ( 5 ) is obviously true for i « 0, it is also true, 
by recursion, for all integers i >0. 





